19 research outputs found

    Live, Personal Data Integration Through UI-Oriented Computing

    Get PDF
    This paper proposes a new perspective on the problem of data integration on the Web: the one of the Surface Web. It introduces the concept of UI-oriented computing as a computing paradigm whose core ingredient are the user interfaces that build up the SurfaceWeb, and shows how a sensible mapping of data integration tasks to user interface elements and user interactions is able to cope with data integration scenarios that so far have only be conceived for the Deep Web with its APIs and Web services. The described approach provides a novel conceptual and technological framework for practices, such as the integration of data APIs/services and the extraction of content from Web pages, that are common practice but still not adequately supported. The approach targets both programmers and users alike and comes as an extensible, open-source browser extension

    Research directions in data wrangling: Visualizations and transformations for usable and credible data

    Get PDF
    In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process of ‘data wrangling’ often constitutes the most tedious and time-consuming aspect of analysis. Though data cleaning and integration arelongstanding issues in the database community, relatively little research has explored how interactive visualization can advance the state of the art. In this article, we review the challenges and opportunities associated with addressing data quality issues. We argue that analysts might more effectively wrangle data through new interactive systems that integrate data verification, transformation, and visualization. We identify a number of outstanding research questions, including how appropriate visual encodings can facilitate apprehension of missing data, discrepant values, and uncertainty; how interactive visualizations might facilitate data transform specification; and how recorded provenance and social interaction might enable wider reuse, verification, and modification of data transformations

    PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources

    No full text
    On lineInternational audienceSchema matching is a critical problem in many applications where the main goal is to match attributes coming from heterogeneous sources. In this paper, we propose PROCLAIM (PROfile-based Cluster-Labeling for AttrIbute Matching), an automatic, unsupervised clustering-based approach to match attributes of a large number of heterogeneous sources. We define the concept of attribute profile to characterize the main properties of an attribute using: (i) the statistical distribution and the dimension of the attribute's values, (ii) the name and textual descriptions related to the attribute. The attribute matchings produced by PROCLAIM give the best representation of heterogeneous sources thanks to the cluster-labeling function we defined. We evaluate PROCLAIM on 45,000 different data sources coming from oil and gas authority open data website3. The results we obtain are promising and validate our approach

    Fuzzy Semantic Labeling of Semi-structured Numerical Datasets

    No full text
    SPARQL endpoints provide access to rich sources of data (e.g. knowledge graphs), which can be used to classify other less structured datasets (e.g. CSV files or HTML tables on the Web). We propose an approach to suggest types for the numerical columns of a collection of input files available as CSVs. Our approach is based on the application of the fuzzy c-means clustering technique to numerical data in the input files, using existing SPARQL endpoints to generate training datasets. Our approach has three major advantages: it works directly with live knowledge graphs, it does not require knowledge-graph profiling beforehand, and it avoids tedious and costly manual training to match values with types. We evaluate our approach against manually annotated datasets. The results show that the proposed approach classifies most of the types correctly for our test sets

    Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings

    No full text
    International audienceWeb tables constitute valuable sources of information for various applications, ranging from Web search to Knowledge Base (KB) augmentation. An underlying common requirement is to annotate the rows of Web tables with semantically rich descriptions of entities published in Web KBs. In this paper, we evaluate three unsupervised annotation methods: (a) a lookup-based method which relies on the minimal entity context provided in Web tables to discover correspondences to the KB, (b) a semantic embeddings method that exploits a vectorial representation of the rich entity context in a KB to identify the most relevant subset of entities in the Web table, and (c) an ontology matching method, which exploits schematic and instance information of entities available both in a KB and a Web table. Our experimental evaluation is conducted using two existing benchmark data sets in addition to a new large-scale benchmark created using Wikipedia tables. Our results show that: (1) our novel lookup-based method outperforms state-of-the-art lookup-based methods, (2) the semantic embeddings method outperforms lookup-based methods in one benchmark data set, and (3) the lack of a rich schema in Web tables can limit the ability of ontology matching tools in performing high-quality table annotation. As a result, we propose a hybrid method that significantly outperforms individual methods on all the benchmarks
    corecore